Introduction

To investigate ride behavior differences between casual and member users and uncover temporal and spatial patterns in ride activity, a comprehensive and well-structured database is essential. The analysis focuses on understanding how ride patterns vary across time—daily, weekly, and seasonally—and space—stations and routes—while identifying trends in ride duration, station popularity, and overall demand. These insights are critical for guiding Divvy’s operational decisions and marketing strategies.

The source data for this project consists of 12 monthly Divvy trip datasets for the year 2024, containing ride-level information such as ride identifiers, timestamps, start and end stations, and user type (casual vs. member). To efficiently support analysis, a relational database will be designed to:

  • Consolidate the monthly datasets into a single, queryable structure.
  • Maintain data integrity with primary keys and appropriate data types for timestamps, text fields, and identifiers.
  • Enable temporal analysis by storing ride start and end times in a standardized timestamp format.
  • Support spatial analysis by including station names and IDs, allowing examination of station popularity and route patterns.
  • Facilitate user segmentation by distinguishing between casual and member riders.

By implementing this database, analysts will be able to efficiently query and aggregate data, uncover patterns in ride behavior, and generate actionable insights for Divvy’s operational planning and marketing initiatives.


Database Creation

Database connection

# Read config
config <- read.ini("resources/db_config.ini")
db <- config$postgresql

# Safe database connection
tryCatch({
  con <- dbConnect(
    Postgres(),
    host = db$host,
    dbname = db$database,
    user = db$user,
    password = db$password,
    port = as.integer(db$port)
  )
}, error = function(e) {
  stop("Database connection failed: ", e$message)
})

# Register connection for SQL chunks
knitr::opts_chunk$set(connection = con)

Prerequisites

Enable the pgcrypto extension for UUID generation

CREATE EXTENSION IF NOT EXISTS pgcrypto;

Enable the btree_gin extension for better composite indexing

CREATE EXTENSION IF NOT EXISTS btree_gin;

Schema & Base Tables

Create a schema to hold all Divvy tables

CREATE SCHEMA IF NOT EXISTS divvy;

Create monthly staging tables. Staging tables mirror the CSV columns exactly. They’re fast to load and easy to QA. We’ll later upsert into a normalized fact table.

months <- c("january","february","march","april","may","june",
             "july","august","september","october","november","december")

for (m in months) {
  sql <- glue("
    CREATE TABLE IF NOT EXISTS divvy.{m} (
      ride_id             TEXT PRIMARY KEY,
      rideable_type       TEXT,
      started_at          TIMESTAMP,
      ended_at            TIMESTAMP,
      start_station_name  TEXT,
      start_station_id    TEXT,
      end_station_name    TEXT,
      end_station_id      TEXT,
      member_casual       TEXT
    );
  ")
  dbExecute(con, sql)
}

Load the CSVs

# month name for a given numeric month
month_name <- function(m) tolower(
  format(as.Date(paste0("2024-", sprintf("%02d", m), "-01")), "%B"))

# loop and load
for (m in 1:12) {
  fname <- sprintf("resources/data/2024%02d-divvy-tripdata.csv", m)
  tbl   <- month_name(m)

  message("Loading: ", fname, " -> divvy.", tbl)
  df <- readr::read_csv(fname, show_col_types = FALSE)

  # Optional: select/rename only the columns we expect
  expect <- c("ride_id","rideable_type","started_at","ended_at",
              "start_station_name","start_station_id",
              "end_station_name","end_station_id","member_casual")
  df <- df[, intersect(expect, names(df))]

  DBI::dbWriteTable(
    con,
    name = DBI::Id(schema="divvy", table=tbl),
    value = df,
    append = TRUE,     # append to existing monthly table
    row.names = FALSE
  )
}

———————————————————————- The END ———————————————————————–